feat(sinker): Ship 8 — Nv24/Nv42 RGBA + Strategy A RGB→RGBA fan-out by uqio · Pull Request #20 · Findit-AI/colconv

uqio · 2026-04-26T06:36:36Z

Tranche 4b of Ship 8 sink-side RGBA. Adds Nv24 / Nv42 (semi-planar 4:4:4) RGBA output via the dual-const-generic <SWAP_UV, ALPHA> template established by PR #17 (NV12 / NV21), and retro-applies a Strategy A combined RGB→RGBA fan-out to all 8 wired families so callers attaching both with_rgb and with_rgba no longer pay the per-pixel YUV→RGB math twice — addresses the Copilot review finding from PR #19 (src/sinker/mixed.rs:1648).

Scope

#	Tranche	Formats	Status
1	4:2:0 planar	`Yuv420p`	✅ shipped (PR #16)
2	4:2:0 semi-planar	`Nv12`, `Nv21`	✅ shipped (PR #17)
3	4:2:2 planar + semi-planar	`Yuv422p`, `Nv16`	✅ shipped (PR #18)
4a	4:4:4 planar	`Yuv444p`	✅ shipped (PR #19)
4b	4:4:4 semi-planar	`Nv24`, `Nv42`	⏳ this PR + Strategy A retro-applied to all 8 wired families
4c	4:4:0 planar	`Yuv440p`	next — wiring-only (reuses `yuv_444_to_rgba_row`)
5	High-bit-depth 4:2:0	`Yuv420p9/10/12/14/16`, `P010/P012/P016`
6	High-bit-depth 4:2:2	`Yuv422p9/10/12/14/16`, `Yuv440p10/12`, `P210/P212/P216`
7	High-bit-depth 4:4:4	`Yuv444p9/10/12/14/16`, `P410/P412/P416`

Usage:

```rust
use colconv::{
frame::Nv24Frame,
sinker::MixedSinker,
yuv::{Nv24, nv24_to},
ColorMatrix,
};

let frame = Nv24Frame::new(&y_plane, &uv_plane, w, h, w, 2 * w);
let mut rgb = vec![0u8; (w * h * 3) as usize];
let mut rgba = vec![0u8; (w * h * 4) as usize];
let mut sinker = MixedSinker::::new(w as usize, h as usize)
.with_rgb(&mut rgb)?
.with_rgba(&mut rgba)?;

// Both buffers requested → YUV→RGB math runs once, RGBA derived via the
// Strategy A fan-out (no double per-pixel cost).
nv24_to(&frame, /full_range=/ true, ColorMatrix::Bt709, &mut sinker)?;
```

What's in this PR

Public API

`MixedSinker::with_rgba(&mut [u8])` / `set_rgba` and `MixedSinker::with_rgba` / `set_rgba` — format-specific impl blocks.
`row::nv24_to_rgba_row(...)` and `row::nv42_to_rgba_row(...)` — public dispatchers paralleling the RGB variants.

Kernel work — NV24/NV42 RGBA

Mirrors PR #17 (NV12/NV21) shape: dual const generic `<const SWAP_UV: bool, const ALPHA: bool>` on a single shared `nv24_or_nv42_to_rgb_or_rgba_row_impl` kernel per backend, with 4 thin wrappers (NV24 RGB / NV42 RGB / NV24 RGBA / NV42 RGBA) forwarding the 4 `(SWAP_UV, ALPHA)` combinations. The compiler monomorphizes into 4 separate functions; the `if ALPHA` branch and the unused alpha-vector splat are DCE'd at each call site.

File	What's added
`row/scalar.rs`	NV24/NV42 RGBA + `<SWAP_UV, ALPHA>` template + `expand_rgb_to_rgba_row` helper
`arch/neon.rs`	NV24/NV42 RGBA; uses `vst4q_u8` when `ALPHA = true`, `vst3q_u8` otherwise
`arch/x86_sse41.rs`	NV24/NV42 RGBA; reuses `write_rgba_16` from PR #16
`arch/x86_avx2.rs`	NV24/NV42 RGBA; reuses `write_rgba_32` from PR #16
`arch/x86_avx512.rs`	NV24/NV42 RGBA; reuses `write_rgba_64` from PR #16
`arch/wasm_simd128.rs`	NV24/NV42 RGBA; reuses wasm `write_rgba_16` from PR #16

Strategy A — combined RGB→RGBA fan-out

Tranches 1–4a wired RGB and RGBA as independent kernel calls — when a caller attached both `with_rgb` and `with_rgba`, `MixedSinker::process` ran the YUV→RGB per-pixel math twice. Copilot review of PR #19 (`src/sinker/mixed.rs:1648`) flagged this.

This PR addresses it by:

Adding `pub(crate) fn expand_rgb_to_rgba_row(rgb, rgba_out, width)` in `row/scalar.rs` — memory-bound copy + `0xFF` alpha pad.
Reworking each `MixedSinker::process` across all 8 wired families (Yuv420p, Yuv422p, Yuv444p, Nv12, Nv16, Nv21, Nv24, Nv42). Output mode resolution per row:
- RGBA-only (no RGB / HSV): dedicated `*_to_rgba_row` kernel directly into the output buffer.
- RGB / HSV (± RGBA): RGB kernel once into `rgb_row` (or `rgb_scratch`), then HSV derivation if requested, then `expand_rgb_to_rgba_row` if RGBA also requested.

Effective memory traffic for the both-buffers case: 3W RGB write + 3W L1-hot read + 4W RGBA write ≈ 7W — same as a hypothetical combined kernel ("Strategy B"), at ~1/10th the LOC.

Strategy B (a third const generic on every kernel doing both stores per pixel) is documented as a future follow-up in `docs/color-conversion-functions.md` § Ship 8 — only worth the LOC cost if profiling later shows the L1-readback step matters.

MixedSinker integration

`with_rgba` / `set_rgba` declared on format-specific impl blocks (per PR #16 safety pattern) — attaching RGBA to a sink that doesn't write it is a compile error rather than a silent stale-buffer bug. The `compile_fail` doctest negative example moved forward from `Nv24` to `Yuv440p` (next not-yet-wired format).

Doc updates

`docs/color-conversion-functions.md` § Ship 8 — tranche tracker updated (4a ✅ shipped, 4b ⏳ this PR, 4c next), new "Combined RGB + RGBA path: Strategy A (shipped) + Strategy B (deferred)" subsection enumerating the tradeoff space.
`docs/color-conversion-functions.md` § 2a (new) — "Real-world asset library format frequency" table calibrating § 2's priority tiers against post-production MAM / streaming / VFX / live-broadcast archetypes. Adds rows for AV1, AVC-Intra / Canon XF-AVC, camera RAW family, stills, and an "Other / unaccounted" residual; per-row notes flag the bimodal cases (DNxHR, ProRes 422). Tranche 6 (10-bit 4:2:2 RGBA) ranks as the single biggest unlock at 30–55% combined for post-production MAM workloads.

Tests

+16 lib tests on aarch64 (475 vs. 459 in PR #19); per-backend tests on the other 4 SIMD backends fire on their matching CI runners.

Layer	Tests added
Scalar `expand_rgb_to_rgba_row`	3: alpha-pad / RGB-preserve, only-first-N-pixels invariant, zero-width no-op
Format-level Nv24 RGBA	4: gray-to-gray + opaque alpha, RGB-byte invariant, buffer-too-short, random-YUV SIMD parity (1922×4 frame, all 4 matrices × both ranges)
Format-level Nv42 RGBA	4: same shape
Cross-format Strategy A umbrella	1: `strategy_a_rgb_and_rgba_byte_identical_for_all_wired_families` exercises all 8 `process` impls and asserts `rgba[i4..i4+3] == rgb[i3..i3+3]` with `rgba[i*4+3] == 0xFF` per pixel
NEON per-backend (verified locally)	4: 16-pixel all-matrices + varied widths (1, 3, 15, 17, 32, 33, 1920, 1921 — odd widths validate the 4:4:4 no-parity contract) × NV24 / NV42
SSE4.1 per-backend (CI)	4: same shape
AVX2 per-backend (CI)	4: 32-pixel main loop + tail widths × NV24 / NV42
AVX-512 per-backend (CI)	4: 64-pixel main loop + tail widths × NV24 / NV42
wasm simd128 per-backend (CI)	4: 16-pixel + tail widths × NV24 / NV42

Per-backend tests bypass the dispatcher (call each backend's `unsafe nv24_to_rgba_row` / `nv42_to_rgba_row` directly under runtime feature detection) so on AVX-512-capable CI runners all three x86 paths run.

Local results (aarch64 macOS): 475 lib tests + 1 doctest pass; wasm32 + x86_64 cross-targets compile clean.

What's deferred

Tranche 4c — `Yuv440p` — wiring-only PR, reuses `yuv_444_to_rgba_row`.
Tranches 5–7 — high-bit-depth families.
`with_rgba_u16` ships in tranches 5–7.
YUVA source frames (Ship 8b) — independent follow-up.
Strategy B (combined kernel writing both stores per pixel) — future optimization, only if profiling shows the L1-readback step matters.
Cleanup PR after merge — split inline `mod tests` blocks out of large source files (`mixed.rs`, per-arch backends, `scalar.rs`); also covers visibility tightening on `_impl` functions (Copilot finding feat(NV12): NV12(semi-planar 4:2:0) + fallible PixelSink contract #2 — kept `pub(crate)` here for consistency with NV12/NV21's existing pattern; should land as a sweep across all `_impl`s) and RGBA-plane bounds-check helper extraction across all 8 `process` impls (Copilot finding feat(yuv420p10): 10-bit YUV 4:2:0 planar → u8 + native u16 RGB #4).

Test plan

CI green on `test`, `test-sde-avx512`, `cross`, `coverage`, `clippy`, `build`, `miri-*` jobs.
Per-tier coverage matrix exercises SSE4.1 / AVX2 / scalar paths via existing `colconv_disable_*` rustflags.
Verify Nv24 / Nv42 → both-buffers (RGB + RGBA) pipeline end-to-end with a real frame (gray + non-gray patches).
`cargo doc --lib --no-deps` clean (no new doc warnings vs. main).

🤖 Generated with Claude Code

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds RGBA output support for NV24/NV42 and introduces a “Strategy A” optimization to avoid running YUV→RGB math twice when both RGB and RGBA outputs are attached.

Changes:

Add NV24/NV42 RGBA row converters (scalar + SIMD dispatch) and wire them into MixedSinker.
Implement Strategy A fan-out (RGB -> RGBA expansion) to reuse the RGB kernel output when RGBA is also requested.
Add extensive tests covering NV24/NV42 RGBA behavior and SIMD-vs-scalar equivalence.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/sinker/mixed.rs	Wires NV24/NV42 RGBA buffers, adds Strategy A selection logic, and adds sinker-level tests.
src/row/scalar.rs	Adds NV24/NV42 scalar RGBA kernels and the `expand_rgb_to_rgba_row` helper + tests.
src/row/mod.rs	Exposes NV24/NV42 RGBA row APIs with SIMD dispatch.
src/row/arch/x86_sse41.rs	Adds SSE4.1 NV24/NV42 RGBA kernels and equivalence tests.
src/row/arch/x86_avx2.rs	Adds AVX2 NV24/NV42 RGBA kernels and equivalence tests.
src/row/arch/x86_avx512.rs	Adds AVX-512 NV24/NV42 RGBA kernels and equivalence tests.
src/row/arch/wasm_simd128.rs	Adds wasm simd128 NV24/NV42 RGBA kernels and equivalence tests.
src/row/arch/neon.rs	Adds NEON NV24/NV42 RGBA kernels and equivalence tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-26T06:45:04Z

+  ///
+  /// See [`MixedSinker::<Nv24>::with_rgba`] for the same rationale and
+  /// constraints; Nv42 differs only in chroma byte order (V before U).
+  /// Returns `Err(RgbaBufferTooShort)` if `buf.len() < width × height × 4`.


The Nv42::with_rgba doc currently mentions only RgbaBufferTooShort, but the implementation calls self.frame_bytes(4)?, which can also return GeometryOverflow (e.g., on 32-bit targets). Update the doc comment to match the actual error surface (similar to the Nv24 docs above).

Suggested change

/// Returns `Err(RgbaBufferTooShort)` if `buf.len() < width × height × 4`.

/// Returns `Err(GeometryOverflow)` if `width × height × 4` cannot be

/// represented on the current target, or `Err(RgbaBufferTooShort)` if

/// `buf.len() < width × height × 4`.

Copilot · 2026-04-26T06:45:05Z

+///   `out.len() >= width * (if ALPHA { 4 } else { 3 })`.
 #[cfg_attr(not(tarpaulin), inline(always))]
-fn nv24_or_nv42_to_rgb_row_impl<const SWAP_UV: bool>(
+pub(crate) fn nv24_or_nv42_to_rgb_or_rgba_row_impl<const SWAP_UV: bool, const ALPHA: bool>(


This shared kernel function appears to be an internal implementation detail (and was previously private). If it’s not used outside row::scalar, consider making it fn (module-private) again to avoid unnecessarily widening the crate-internal API surface and reduce coupling.

Suggested change

pub(crate) fn nv24_or_nv42_to_rgb_or_rgba_row_impl<const SWAP_UV: bool, const ALPHA: bool>(

fn nv24_or_nv42_to_rgb_or_rgba_row_impl<const SWAP_UV: bool, const ALPHA: bool>(

Copilot · 2026-04-26T06:45:05Z

+  for x in 0..width {
+    rgba_out[x * 4] = rgb[x * 3];
+    rgba_out[x * 4 + 1] = rgb[x * 3 + 1];
+    rgba_out[x * 4 + 2] = rgb[x * 3 + 2];
+    rgba_out[x * 4 + 3] = 0xFF;


expand_rgb_to_rgba_row is likely on a hot path. Consider iterating via chunks_exact(3) zipped with chunks_exact_mut(4) (or a similar approach) to avoid repeated index multiplications and help the compiler generate a tighter loop.

Suggested change

for x in 0..width {

rgba_out[x * 4] = rgb[x * 3];

rgba_out[x * 4 + 1] = rgb[x * 3 + 1];

rgba_out[x * 4 + 2] = rgb[x * 3 + 2];

rgba_out[x * 4 + 3] = 0xFF;

for (rgb_px, rgba_px) in rgb[..width * 3]

.chunks_exact(3)

.zip(rgba_out[..width * 4].chunks_exact_mut(4))

{

rgba_px[0] = rgb_px[0];

rgba_px[1] = rgb_px[1];

rgba_px[2] = rgb_px[2];

rgba_px[3] = 0xFF;

Copilot · 2026-04-26T06:45:06Z

+    if let Some(buf) = rgba.as_deref_mut() {
+      let rgba_plane_end =
+        one_plane_end
+          .checked_mul(4)
+          .ok_or(MixedSinkerError::GeometryOverflow {
+            width: w,
+            height: h,
+            channels: 4,
+          })?;
+      let rgba_plane_start = one_plane_start * 4;
+      expand_rgb_to_rgba_row(rgb_row, &mut buf[rgba_plane_start..rgba_plane_end], w);
+    }


The RGBA-plane bounds/overflow check + slice computation is duplicated across multiple PixelSink impls (Yuv420p/Yuv422p/Yuv444p/Nv12/Nv16/Nv21/Nv24/Nv42). Consider extracting a small helper (e.g., a method that returns the mutable RGBA row slice for [one_plane_start..one_plane_end]) to reduce repetition and the risk of future drift between formats.

codecov · 2026-04-26T06:54:45Z

Codecov Report

❌ Patch coverage is 92.59259% with 22 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/row/arch/neon.rs	75.00%	14 Missing ⚠️
src/row/mod.rs	71.42%	8 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-26T06:57:47Z

  debug_assert!(uv_or_vu.len() >= 2 * width, "chroma row too short");
-  debug_assert!(rgb_out.len() >= width * 3, "rgb_out row too short");
+  let bpp: usize = if ALPHA { 4 } else { 3 };
+  debug_assert!(out.len() >= width * bpp, "out row too short for {bpp}bpp");


The debug_assert! message uses {bpp} inside a string literal, so the actual bpp value will not be interpolated. Consider using a formatted message (e.g., with {} + bpp) so debug builds report the concrete expected stride.

Suggested change

debug_assert!(out.len() >= width * bpp, "out row too short for {bpp}bpp");

debug_assert!(out.len() >= width * bpp, "out row too short for {}bpp", bpp);

… 2 PR #20 review fixes (#21)

update

f639f42

al8n requested a review from Copilot April 26, 2026 06:36

Copilot AI reviewed Apr 26, 2026

View reviewed changes

Copilot started reviewing on behalf of al8n April 26, 2026 06:50 View session

al8n requested a review from Copilot April 26, 2026 06:53

Copilot started reviewing on behalf of al8n April 26, 2026 06:54 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

al8n changed the title ~~update~~ feat(sinker): Ship 8 — Nv24/Nv42 RGBA + Strategy A RGB→RGBA fan-out Apr 26, 2026

uqio merged commit 3dc020d into main Apr 26, 2026
15 of 75 checks passed

uqio deleted the feat/ship8-rgba-nv24-nv42 branch April 26, 2026 07:05

al8n mentioned this pull request Apr 26, 2026

refactor(tests): split inline test mods into sibling tests.rs files + 2 PR #20 review fixes #21

Merged

8 tasks

uqio added a commit that referenced this pull request Apr 26, 2026

refactor(tests): split inline test mods into sibling tests.rs files +…

9596d14

… 2 PR #20 review fixes (#21)

al8n mentioned this pull request Apr 26, 2026

feat(sinker): Ship 8 — Yuv440p RGBA wiring (reuses Yuv444p kernels) #22

Merged

4 tasks

Copilot AI mentioned this pull request Apr 26, 2026

chore: cleanup impl visibility and helpers #23

Merged

This was referenced Apr 26, 2026

feat(row): Ship 8 — high-bit 4:2:0 RGBA scalar (SIMD lands in 5a/5b) #24

Merged

Ship 8 Tranche 5b: high-bit 4:2:0 RGBA u16 SIMD + sinker integration #26

Merged

Ship 8 Tranche 7c: high-bit 4:4:4 RGBA u16 SIMD + sinker integration #31

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sinker): Ship 8 — Nv24/Nv42 RGBA + Strategy A RGB→RGBA fan-out#20

feat(sinker): Ship 8 — Nv24/Nv42 RGBA + Strategy A RGB→RGBA fan-out#20
uqio merged 1 commit intomainfrom
feat/ship8-rgba-nv24-nv42

uqio commented Apr 26, 2026 •

edited by al8n

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

codecov Bot commented Apr 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-  /// Returns `Err(RgbaBufferTooShort)` if `buf.len() < width × height × 4`.
+  /// Returns `Err(GeometryOverflow)` if `width × height × 4` cannot be
+  /// represented on the current target, or `Err(RgbaBufferTooShort)` if
+  /// `buf.len() < width × height × 4`.

	pub(crate) fn nv24_or_nv42_to_rgb_or_rgba_row_impl<const SWAP_UV: bool, const ALPHA: bool>(
	fn nv24_or_nv42_to_rgb_or_rgba_row_impl<const SWAP_UV: bool, const ALPHA: bool>(

-  for x in 0..width {
-    rgba_out[x * 4] = rgb[x * 3];
-    rgba_out[x * 4 + 1] = rgb[x * 3 + 1];
-    rgba_out[x * 4 + 2] = rgb[x * 3 + 2];
-    rgba_out[x * 4 + 3] = 0xFF;
+  for (rgb_px, rgba_px) in rgb[..width * 3]
+    .chunks_exact(3)
+    .zip(rgba_out[..width * 4].chunks_exact_mut(4))
+  {
+    rgba_px[0] = rgb_px[0];
+    rgba_px[1] = rgb_px[1];
+    rgba_px[2] = rgb_px[2];
+    rgba_px[3] = 0xFF;

	debug_assert!(out.len() >= width * bpp, "out row too short for {bpp}bpp");
	debug_assert!(out.len() >= width * bpp, "out row too short for {}bpp", bpp);

Conversation

uqio commented Apr 26, 2026 • edited by al8n Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scope

What's in this PR

Public API

Kernel work — NV24/NV42 RGBA

Strategy A — combined RGB→RGBA fan-out

MixedSinker integration

Doc updates

Tests

What's deferred

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 26, 2026

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

uqio commented Apr 26, 2026 •

edited by al8n

Loading